[Chapter 6] Applying PCA to the nutrition data

[DSLC stages]: Analysis

In this document, you will find the a Principal Component Analysis (PCA) of the food nutrition data. Our goal is to come up with some meaningful “summary nutrient” variables that can be used to summarize the nutrient variables in each nutrient group.

We will first conduct an analysis for just the major minerals nutrient group data, and will then use some advanced R code to automate the principal component analysis across each nutrient group.

Note also, that when defining functions, we often define the arguments with a period (.) preceding the argument name, such as function(.data, .argument1, .argument2). This is purely to make it easier to see what is an external object and what is an argument of the function when looking at the code (objects with a preceding period only exist within the function “environment”).

The following code sets up the libraries and creates cleaned and pre-processed data that we will use in this document.

library(tidyverse)
library(patchwork)

# source the cleaning function code
source("functions/cleanFoodData.R") 
source("functions/preProcessFoodData.R")
# load in all of the raw data objects
nutrient_amount <- read_csv("../data/food_nutrient.csv")
food_name <- read_csv("../data/food.csv")
nutrient_name <- read_csv("../data/nutrient_name.csv")

# Clean the FNDDS data
food_fndds <- cleanFoodData(.nutrient_amount_data = nutrient_amount, 
                            .food_name_data = food_name,
                            .nutrient_name_data = nutrient_name, 
                            # fndds is the default value
                            .select_data_type = "survey_fndds_food")
# Preprocess the FNDDS data
food_fndds_scaled <- preProcessFoodData(food_fndds, 
                                        # these are the default args
                                        .log_transform = FALSE,
                                        .center = FALSE,
                                        .scale = TRUE,
                                        .remove_fat = FALSE)
food_fndds_log_scaled <- preProcessFoodData(food_fndds, 
                                            # these are the default args
                                            .log_transform = TRUE,
                                            .center = FALSE,
                                            .scale = TRUE,
                                            .remove_fat = FALSE)
# Clean the Legacy data
food_legacy <- cleanFoodData(.nutrient_amount_data = nutrient_amount, 
                             .food_name_data = food_name,
                             .nutrient_name_data = nutrient_name, 
                             .select_data_type = "sr_legacy_food")
# Preprocess the Legacy data
food_legacy_log_scaled <- preProcessFoodData(food_legacy,
                                             # these are the default args
                                             .log_transform = TRUE,
                                             .center = FALSE,
                                             .scale = TRUE,
                                             .remove_fat = FALSE)

1 Applying principal component analysis to the full dataset

Let’s start by applying simple PCA algorithm to the full food (FNDDS) dataset and see what happens. We will for now just focus on the scaled and un-centered (i.e. each variable has only been divided by its standard deviation).

Ideally the components we identify will each correspond to a “primary” type of nutrient (spoiler: this doesn’t work!).

First, we conduct SVD on the scaled, un-centered dataset.

pca <- food_fndds_scaled |>
  select(-description) |>
  svd()

The code below shows that the first principal component explains 34% of the total variability (relative to zero) in the data.

prop_var <- pca$d^2 / sum(pca$d^2)
tibble(PC = 1:length(prop_var),
       `Variability explained` = round(prop_var, 2),
       `Cumulative variability explained` = round(cumsum(prop_var), 2)) |>
  head(5)
# A tibble: 5 × 3
     PC `Variability explained` `Cumulative variability explained`
  <int>                   <dbl>                              <dbl>
1     1                    0.34                               0.34
2     2                    0.09                               0.43
3     3                    0.07                               0.5 
4     4                    0.06                               0.56
5     5                    0.05                               0.61

The following table shows the 10 variables with the highest loading on PC1. PC1 mostly seems to be capturing fats.

# extract the loadings
loadings <- pca$v
# look at the 10 largest (absolute value) loadings for PC1
tibble(nutrient = colnames(select(food_fndds_scaled, -description)),
       pc1_loading = loadings[, 1]) |>
  # arrange in absolute value descending order
  arrange(desc(abs(pc1_loading))) |> 
  head(10)
# A tibble: 10 × 2
   nutrient      pc1_loading
   <chr>               <dbl>
 1 moisture           -0.337
 2 calories           -0.292
 3 phosphorus         -0.233
 4 potassium          -0.227
 5 protein            -0.220
 6 sodium             -0.214
 7 fat                -0.196
 8 palmitic_acid      -0.186
 9 carbohydrates      -0.180
10 magnesium          -0.176

The following table shows the 10 variables with the highest loading on PC2. They mostly seem to be vitamins (negative loading), but some fats also appear.

data.frame(nutrient = colnames(select(food_fndds_scaled, -description)),
           pc2_loading = loadings[, 2]) |>
  arrange(desc(abs(pc2_loading))) |>
  head(10)
              nutrient pc2_loading
1        saturated_fat   0.2784364
2             moisture  -0.2688873
3        palmitic_acid   0.2441884
4        myristic_acid   0.2441453
5          capric_acid   0.2334087
6                  fat   0.2302012
7         stearic_acid   0.2295276
8         caproic_acid   0.2178821
9         butyric_acid   0.2073644
10 monounsaturated_fat   0.1830177

There is clearly some kind of “nutrient type” being captured by these two PCs, but the fact that the vast majority of the variability in the data is contained within the first PC, and that each PC involves variables from across several different “categories” (e.g., PC2 contains fats and vitamins) means that they will not be particularly useful for creating summary variables of individual nutrient types.

Regardless, let’s take a look at these principal components in terms of how the foods seem to be distributed in PC space.

First, we need to create a version of the nutrient dataset that has been transformed into PC space. This involves multiplying the original data matrix with the PC loading matrix. The resulting dataset (pca_scaled_x) has one row for each food item, but instead of the nutrients as columns, it has the PCs.

# create the PCA-transformed dataset
# first, create a numeric matrix version of the original data
x <- food_fndds_scaled |>
  select(-description) |>
  as.matrix()
# then multiply the original data and the PCA loadings
pca_scaled_x <- (x %*% pca$v) |>
  # convert to a tibble
  as_tibble()
# make the tibble pretty and easier to work with by
# changing the column names to PC1, PC2, etc
colnames(pca_scaled_x) <- paste0("PC", 1:ncol(pca_scaled_x))

# add back the food descriptions column
pca_scaled_x <- pca_scaled_x |>
  mutate(description = food_fndds_scaled$description) |>
  # reorder the rows
  select(description, everything())

# look at the object
head(pca_scaled_x)
# A tibble: 6 × 58
  description   PC1     PC2   PC3   PC4    PC5   PC6    PC7    PC8   PC9    PC10
  <chr>       <dbl>   <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl> <dbl>   <dbl>
1 Milk, human -2.44 -0.150   1.98 0.883 0.0441 0.514 -0.102 -0.987 0.851 -0.0278
2 Milk, NFS   -2.59 -0.635   1.83 1.20  0.537  0.562 -0.412 -0.606 0.508 -0.497 
3 Milk, whole -2.76 -0.190   1.83 1.30  0.754  0.480 -0.381 -0.647 0.544 -0.482 
4 Milk, low … -2.89 -0.0887  1.98 1.30  0.711  0.502 -0.456 -0.526 0.382 -0.629 
5 Milk, calc… -2.76 -0.190   1.83 1.30  0.754  0.480 -0.381 -0.647 0.544 -0.482 
6 Milk, calc… -2.41 -1.03    1.84 1.11  0.329  0.625 -0.472 -0.550 0.515 -0.430 
# ℹ 47 more variables: PC11 <dbl>, PC12 <dbl>, PC13 <dbl>, PC14 <dbl>,
#   PC15 <dbl>, PC16 <dbl>, PC17 <dbl>, PC18 <dbl>, PC19 <dbl>, PC20 <dbl>,
#   PC21 <dbl>, PC22 <dbl>, PC23 <dbl>, PC24 <dbl>, PC25 <dbl>, PC26 <dbl>,
#   PC27 <dbl>, PC28 <dbl>, PC29 <dbl>, PC30 <dbl>, PC31 <dbl>, PC32 <dbl>,
#   PC33 <dbl>, PC34 <dbl>, PC35 <dbl>, PC36 <dbl>, PC37 <dbl>, PC38 <dbl>,
#   PC39 <dbl>, PC40 <dbl>, PC41 <dbl>, PC42 <dbl>, PC43 <dbl>, PC44 <dbl>,
#   PC45 <dbl>, PC46 <dbl>, PC47 <dbl>, PC48 <dbl>, PC49 <dbl>, PC50 <dbl>, …

Now we can visualize our foods in PC space. Figure 1 shows the distribution of foods in the (PC1, PC2) space. Overall, it seems as though there is very little information in PC2 (the spread is mostly in the PC1 direction).

# plot PC1 vs PC2
pca_scaled_x |>
  ggplot() +
  geom_point(aes(x = PC1, y = PC2), alpha = 0.3, color = "grey60") +
  geom_text(aes(x = PC1, y = PC2, label = str_trunc(description, 15)), 
            check_overlap = TRUE, hjust = 0)

Figure 1: A scatterplot of PC1 vs PC2

While it certainly seems as though these principal components are creating a space in which similar food items are close together, these principal components still aren’t going to be helpful for generating intuitive summary variables for each nutrient group since every nutrient variable contributes to each principal component, just with different weights.

2 Defining nutrient groups

Thus, let’s shift our goal for this project to create a summary variable separately for each nutrient group. That is, we want to apply PCA separately to group of nutrient variables that we identified in the EDA doc.

We will aim to create a summary variable based on the first Principal Component computed separately for each of the following groups:

  1. Vitamins: vitamin C, vitamin B6, vitamin B12, riboflavin, thiamine, folate, niacin, beta carotene, alpha carotene, lutein zeaxanthin, phylloquinone, alpha tocopherol, retinol, lycopene, cryptoxanthin

  2. Fats: fat, saturated fat, monounsaturated fat, polyunsaturated fat, cholesterol, and all of the fatty acids

  3. Major minerals: sodium, potassium, calcium, phosphorus, magnesium, total choline

  4. Trace minerals: iron, zinc, selenium, copper

  5. Carbohydrates: carbohydrates, total dietary fiber

  6. Calories: calories

  7. Protein: protein

Since the last two groups (calories and protein) only consist of one nutrient each, we don’t need to apply PCA to summarize these groups.

3 Applying PCA to the major minerals data only

Let’s focus first just on the scaled but un-centered major minerals nutrient dataset that we explored in Chapter 7.

food_fndds_scaled_major_minerals <- food_fndds_scaled |>
  select(description, sodium, potassium, calcium, 
         phosphorus, magnesium, total_choline)

Then we can apply PCA to it using SVD:

major_minerals_pca <- food_fndds_scaled_major_minerals |>
  # remove the description character variable
  select(-description) |>
  # apply svd
  svd()
# print out the names of the objects contained within major_minerals_pca
names(major_minerals_pca)
[1] "d" "u" "v"

Let’s look at the right-singular vector matrix for the major minerals dataset. The rows correspond to the major minerals, and the columns correspond to the PCs.

# extract the right-singular vectors
major_minerals_v <- major_minerals_pca$v
# add column and rownames
colnames(major_minerals_v) <- paste0("PC", 1:ncol(major_minerals_v))
rownames(major_minerals_v) <- colnames(food_fndds_scaled_major_minerals)[-1]
# print the matrix V (rounded to 3dp)
round(major_minerals_v, 3)
                 PC1    PC2    PC3    PC4    PC5    PC6
sodium        -0.424 -0.633 -0.226  0.579 -0.151  0.100
potassium     -0.484  0.279  0.429  0.300  0.635 -0.101
calcium       -0.290  0.238 -0.785 -0.215  0.350  0.271
phosphorus    -0.486  0.065 -0.121 -0.281 -0.304 -0.757
magnesium     -0.377  0.508  0.158  0.106 -0.594  0.459
total_choline -0.351 -0.451  0.329 -0.661  0.076  0.349

3.1 Create the PCS-transformed dataset

To create the PCA-transformed dataset, we need to multiply the original data by this matrix \(V\)

# get a matrix version of the original major minerals dataset (X)
major_minerals_x <- food_fndds_scaled_major_minerals |>
  select(-description) |>
  as.matrix()
# compute the PCA transformation
pca_major_minerals_x <- major_minerals_x %*% major_minerals_v

# add the description 
pca_major_minerals_x <- pca_major_minerals_x |>
  as_tibble() |>
  mutate(description = food_fndds_scaled_major_minerals$description, .before = "PC1")
pca_major_minerals_x
# A tibble: 8,690 × 7
   description                         PC1     PC2      PC3    PC4   PC5     PC6
   <chr>                             <dbl>   <dbl>    <dbl>  <dbl> <dbl>   <dbl>
 1 Milk, human                      -0.459 -0.0202  0.00547 -0.236 0.203  0.144 
 2 Milk, NFS                        -1.34   0.395  -0.468   -0.367 0.426 -0.0678
 3 Milk, whole                      -1.25   0.372  -0.454   -0.335 0.410 -0.0599
 4 Milk, low sodium, whole          -1.43   0.510  -0.0856  -0.245 0.868 -0.227 
 5 Milk, calcium fortified, whole   -1.25   0.372  -0.454   -0.335 0.410 -0.0599
 6 Milk, calcium fortified, low fa… -1.41   0.404  -0.478   -0.407 0.467 -0.0672
 7 Milk, calcium fortified, fat fr… -1.67   0.600  -1.02    -0.505 0.744  0.0653
 8 Milk, reduced fat (2%)           -1.35   0.385  -0.475   -0.379 0.423 -0.0654
 9 Milk, acidophilus, low fat (1%)  -1.41   0.404  -0.478   -0.407 0.467 -0.0672
10 Milk, acidophilus, reduced fat … -1.35   0.385  -0.475   -0.379 0.423 -0.0654
# ℹ 8,680 more rows

Let’s extract the principal components for the “Herring, smoked, kippered” food item.

pca_major_minerals_x |> 
  filter(description == "Herring, smoked, kippered")
# A tibble: 1 × 7
  description                 PC1   PC2   PC3    PC4    PC5    PC6
  <chr>                     <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1 Herring, smoked, kippered -5.17 -1.22 0.395 0.0194 -0.124 -0.382

3.2 Visualizing food items in PC space

Next, we can look at a sample of 500 food items in PC1-PC2 space.

set.seed(23178989)
pca_major_minerals_x |>
  sample_n(500) |>
  ggplot() +
  geom_point(aes(x = PC2, y = -PC1), alpha = 0.2, col = "grey50") +
  geom_text(aes(x = PC2, y = -PC1, 
                # show just the first word of the description
                label = map_chr(description, ~str_split(., " ", simplify = TRUE)[1])), 
            check_overlap = TRUE, alpha = 0.9, nudge_x = 0.5, hjust = 0)

Note that in the book we consider the negative of PC1, so we negate the PC1 axis here for comparability.

Similar food items seem to be similarly positioned in PC1-PC2 space.

3.3 Proportion of variability explained

We can compute the proportion of variability explained by each component using the singular value matrix D, which is

diag(major_minerals_pca$d)
         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
[1,] 252.8375  0.00000  0.00000  0.00000  0.00000  0.00000
[2,]   0.0000 97.36215  0.00000  0.00000  0.00000  0.00000
[3,]   0.0000  0.00000 91.12533  0.00000  0.00000  0.00000
[4,]   0.0000  0.00000  0.00000 80.75718  0.00000  0.00000
[5,]   0.0000  0.00000  0.00000  0.00000 62.30406  0.00000
[6,]   0.0000  0.00000  0.00000  0.00000  0.00000 45.01052

The proportion of variability explained (relative to zero because we didn’t center the data) for each component is:

(major_minerals_pca$d^2) / sum(major_minerals_pca$d^2)
[1] 0.67906468 0.10069516 0.08820770 0.06927724 0.04123452 0.02152070

And we can create a scree plot too:

tibble(PC = 1:length(major_minerals_pca$d),
       prop_explained = (major_minerals_pca$d^2) / sum(major_minerals_pca$d^2)) |>
  ggplot() +
  geom_line(aes(x = PC, y = prop_explained))

There is a very clear elbow at the second PC, which implies that taking the first (and maybe second) PC corresponds to a reasonable summary of the dataset.

3.4 Log-transformation

The results above did not involve computing a log-transformation of any of the variables. But as we saw in 02_eda.qmd, the symmetry of many of the variables’ distributions were much improved by computing a log-transformation. It may also be the case that PCA is able to compute more meaningful components if we do compute a log-transformation.

While we could relegate this section to a stability assessment, since we believe that our results might actually be improved by conducting a log-transformation, we will do an in-depth assessment here (if we didn’t think the log-transformation would make any difference, we would instead do this exploration as a stability analysis).

First, let’s create the log-transformed scaled major minerals dataset

food_fndds_log_scaled_major_minerals <- food_fndds_log_scaled |>
  select(description, sodium, potassium, calcium, 
         phosphorus, magnesium, total_choline)

Then we can apply PCA to it using SVD:

major_minerals_log_pca <- food_fndds_log_scaled_major_minerals |>
  # remove the description character variable
  select(-description) |>
  # apply svd
  svd()

Let’s look at the right-singular vector matrix for the major minerals dataset. The rows correspond to the major minerals, and the columns correspond to the PCs.

# extract the right-singular vectors
major_minerals_log_v <- major_minerals_log_pca$v
# add column and rownames
colnames(major_minerals_log_v) <- paste0("PC", 1:ncol(major_minerals_log_v))
rownames(major_minerals_log_v) <- colnames(food_fndds_log_scaled_major_minerals)[-1]
# print the matrix V (rounded to 3dp)
round(major_minerals_log_v, 3)
                 PC1    PC2    PC3    PC4    PC5    PC6
sodium        -0.375 -0.600 -0.439  0.522  0.120 -0.144
potassium     -0.593  0.200  0.496  0.341 -0.495  0.025
calcium       -0.321  0.573 -0.698 -0.153 -0.196 -0.141
phosphorus    -0.414 -0.116 -0.063 -0.308  0.215  0.818
magnesium     -0.379  0.264  0.252 -0.032  0.774 -0.351
total_choline -0.298 -0.435  0.083 -0.702 -0.239 -0.407

To create the PCA-transformed dataset, we need to multiply the original data by this matrix \(V\)

# get a matrix version of the original major minerals dataset (X)
major_minerals_log_x <- food_fndds_log_scaled_major_minerals |>
  select(-description) |>
  as.matrix()
# compute the PCA transformation
pca_major_minerals_log_x <- major_minerals_log_x %*% major_minerals_log_v

# add the description 
pca_major_minerals_log_x <- pca_major_minerals_log_x |>
  as_tibble() |>
  mutate(description = food_fndds_log_scaled_major_minerals$description, .before = "PC1")
pca_major_minerals_log_x
# A tibble: 8,690 × 7
   description                             PC1   PC2    PC3    PC4    PC5    PC6
   <chr>                                 <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Milk, human                           -6.49 0.379 -0.260 -0.580 -1.29  -0.258
 2 Milk, NFS                             -8.84 0.988 -0.513 -0.572 -0.633  0.377
 3 Milk, whole                           -8.68 0.986 -0.534 -0.520 -0.666  0.399
 4 Milk, low sodium, whole               -8.24 1.79   0.395 -1.14  -1.74   0.880
 5 Milk, calcium fortified, whole        -8.68 0.986 -0.534 -0.520 -0.666  0.399
 6 Milk, calcium fortified, low fat (1%) -8.93 0.986 -0.497 -0.642 -0.691  0.366
 7 Milk, calcium fortified, fat free (s… -9.16 1.20  -0.779 -0.566 -0.777  0.375
 8 Milk, reduced fat (2%)                -8.86 0.959 -0.532 -0.586 -0.633  0.366
 9 Milk, acidophilus, low fat (1%)       -8.93 0.986 -0.497 -0.642 -0.691  0.366
10 Milk, acidophilus, reduced fat (2%)   -8.86 0.959 -0.532 -0.586 -0.633  0.366
# ℹ 8,680 more rows

Let’s extract the principal components for the “Herring, smoked, kippered” food item.

pca_major_minerals_log_x |> 
  filter(description == "Herring, smoked, kippered")
# A tibble: 1 × 7
  description                 PC1    PC2    PC3    PC4   PC5    PC6
  <chr>                     <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>
1 Herring, smoked, kippered -11.7 -0.507 -0.109 -0.579 0.132 -0.136

Next, we can look at a sample of 500 food items in PC1-PC2 space.

set.seed(23178989)
pca_major_minerals_log_x |>
  sample_n(500) |>
  ggplot() +
  geom_point(aes(x = PC2, y = PC1), alpha = 0.2, col = "grey50") +
  geom_text(aes(x = PC2, y = PC1, 
                # show just the first word of the description
                label = map_chr(description, ~str_split(., " ", simplify = TRUE)[1])), 
            check_overlap = TRUE, alpha = 0.9, nudge_x = 0.5, hjust = 0)

The food items seem much more spread out in the PC1-PC2 space based on this log-transformed dataset.

We can compute the proportion of variability explained by each component:

(major_minerals_log_pca$d^2) / sum(major_minerals_log_pca$d^2)
[1] 0.972004576 0.008478704 0.007926444 0.006051239 0.003742752 0.001796285

The first component now explains almost 100% of the variability in the original major minerals dataset.

Overall, this log-transformed version of PC1 does seem to be a much better summary variable than the version of PC1 computed on the untransformed data.

3.5 [Exercise: to complete] Compute the correlation between the principal component variables and each of the original variables

For the major minerals dataset, compute the correlation between each original major mineral nutrient variable and the principal components you computed. Compare these correlations with the variable loadings. You may do this here, and/or based on the log-transformed PCA results below.

To get you started, here is one way to compute the correlation between PC1 and the original sodium variable.

cor(pca_major_minerals_log_x$PC1, food_fndds_log_scaled_major_minerals$sodium)
[1] -0.6645283

3.6 PCS assessment

Predictability

Let’s look at the results of these principal components on the (scaled and log-transformed) Legacy dataset.

# create the legacy (log, scaled) major minerals dataset
food_legacy_log_scaled_major_minerals <- food_legacy_log_scaled |>
  select(description, sodium, potassium, calcium, 
         phosphorus, magnesium, total_choline)
# convert it to a numeric matrix
legacy_major_minerals_x <- food_legacy_log_scaled_major_minerals |>
  select(-description) |>
  as.matrix()
# create the PC components for the legacy major minerals dataset
pca_legacy_major_minerals_log_x <- legacy_major_minerals_x %*% major_minerals_log_v

# add the food descriptions
pca_legacy_major_minerals_log_x <- pca_legacy_major_minerals_log_x |>
  as_tibble() |>
  mutate(description = food_legacy_log_scaled_major_minerals$description, .before = "PC1")
pca_legacy_major_minerals_log_x
# A tibble: 7,793 × 7
   description                             PC1    PC2   PC3    PC4     PC5   PC6
   <chr>                                 <dbl>  <dbl> <dbl>  <dbl>   <dbl> <dbl>
 1 Pork, fresh, composite of trimmed le… -7.76 -0.182 0.888 -0.614 -0.0239 0.544
 2 Pork, fresh, backfat, raw             -4.96 -0.609 0.858 -0.533 -0.852  0.724
 3 Pork, fresh, belly, raw               -6.41 -0.553 0.792 -0.450 -0.839  0.927
 4 Pork, fresh, separable fat, raw       -7.11 -0.202 0.596 -0.402 -1.04   0.458
 5 Pork, fresh, leg (ham), rump half, s… -8.05 -0.453 0.861 -0.901 -0.205  0.386
 6 Pork, fresh, leg (ham), rump half, s… -8.39 -0.387 0.839 -1.06  -0.172  0.298
 7 Pork, fresh, leg (ham), rump half, s… -8.14 -0.473 0.877 -0.948 -0.139  0.380
 8 Pork, fresh, leg (ham), rump half, s… -8.47 -0.405 0.858 -1.10  -0.125  0.291
 9 Pork, fresh, loin, whole, separable … -8.05 -0.155 0.775 -1.03  -0.263  0.307
10 Pork, fresh, loin, whole, separable … -8.11 -0.185 0.731 -1.18  -0.442  0.190
# ℹ 7,783 more rows

Then we can look at a random sample of 500 legacy food items in the PC1 and PC2 space:

set.seed(169)
pca_legacy_major_minerals_log_x |>
  sample_n(500) |>
  ggplot() +
  geom_point(aes(x = PC2, y = -PC1), alpha = 0.5, color = "grey50") +
  geom_text(aes(x = PC2, y = -PC1, 
                # show just the first word of the description
                label = map_chr(description, ~str_split(., " ", simplify = TRUE)[1])), 
            check_overlap = TRUE, alpha = 0.9, nudge_x = 0.2, hjust = 0)

(Note that we are showing the negative of PC1 in the book, so we negate it in the plot above for comparison purposes.)

Overall it looks as though similar legacy food items are again being placed close together in PC1 and PC2 space, and the distribution of food items in PC1-PC2 space is very similar.

Stability

Stability to data perturbations

Let’s investigate the stability of our principal component analysis results under the two following types of data perturbations:

  1. Bootstrapping: Since it is plausible that a different set of food items may have been included in the data, and the food items are likely to be exchangeable, we will use a bootstrap sampling perturbation to reflect the sampling uncertainty.

  2. Measurement noise: Since it is plausible that the individual nutrient measurements in the data may have been measured or computed slightly differently, we will add a random number to each measurement that comes from a Normal distribution with mean 0 and standard deviation equal to 0.2 multiplied by the mean of the relevant column. This is motivated by the fact that the FDA has a 20% allowable margin of error for nutrition information.

Below, we compute four such perturbed versions of the FNDDS dataset.

set.seed(98634)
food_fndds_major_minerals_perturb <- map(1:4, function(iter) {
  # conduct the perturbations
  food_fndds_log_scaled_major_minerals |>
    # add random noise
    # add random noise to each observation (noise is equal to 0 on average)
    mutate(across(where(is.numeric), 
                  function(x) x + rnorm(n(), 0, 0.2 * mean(x))))  |>
    # if any observations are negative, round them up to 0
    mutate(across(where(is.numeric), 
                  function(x) if_else(x < 0, true = 0, false = x))) |>
    # take a bootstrap sample
    sample_frac(1, replace = TRUE)
})

Let’s take a look at how the sodium and magnesium values have been perturbed. Below we print the perturbed sodium and magnesium values alongside the original sodium and magnesium values for 10 randomly selected food items for the first perturbed dataset. The values don’t look too aggressively different and feel reasonable.

set.seed(4678)
food_fndds_major_minerals_perturb[[1]] |>
  # select just the sodium and magnesium variables and rename them
  select(description, 
         sodium_perturbed = sodium,
         magnesium_perturbed = magnesium) |>
  # take a random sample of 10 food items
  sample_n(10) |>
  # add the original values to the dataset
  left_join(select(food_fndds_log_scaled_major_minerals, 
                   description, sodium, magnesium),
            by = "description") %>%
  select(description, sodium, sodium_perturbed, magnesium, magnesium_perturbed)
# A tibble: 10 × 5
   description             sodium sodium_perturbed magnesium magnesium_perturbed
   <chr>                    <dbl>            <dbl>     <dbl>               <dbl>
 1 Pad Thai with seafood     3.86             3.50      3.86                4.50
 2 Muffin, pumpkin           3.73             4.27      3.28                2.87
 3 Peas, cooked, from fre…   3.70             2.86      3.78                3.54
 4 Croaker, coated, baked…   3.86             4.41      4.41                5.95
 5 Bacon cheeseburger, 1 …   4.21             3.68      3.78                3.13
 6 Peas, green, cooked, f…   3.78             2.94      3.64                2.48
 7 Infant formula, liquid…   2.09             2.37      2.08                2.43
 8 Cereal (Post Shredded …   3.05             3.60      5.27                5.58
 9 Squash, summer, yellow…   3.44             3.55      3.06                2.43
10 Oatmeal, regular or qu…   3.18             3.05      4.05                4.01

Nest, let’s apply SVD to each perturbed dataset, and extract the variable loadings for each PC from each perturbed dataset.

major_minerals_perturbed_data_loadings_df <- food_fndds_major_minerals_perturb |>
  map_df(function(data_perturb) {
    
    # apply SVD to the perturbed dataset
    pca_perturb <- data_perturb |>
      select(-description) |>
      as.matrix() |>
      svd()
    
    # extract the loadings (matrix V)
    loadings_perturb <- pca_perturb$v
    rownames(loadings_perturb) <- colnames(data_perturb)[-1]
    colnames(loadings_perturb) <- paste0("PC", 1:ncol(loadings_perturb))
    
    
    # put results in data frame
    loadings_perturb <- loadings_perturb |>
      as.data.frame() |>
      rownames_to_column("nutrient")
    
    return(loadings_perturb)
  }, .id = "iter")
major_minerals_perturbed_data_loadings_df
   iter      nutrient        PC1         PC2          PC3         PC4
1     1        sodium -0.3718561  0.47084555 -0.622289851 -0.43334310
2     1     potassium -0.5990483 -0.75759884 -0.182326817 -0.12398457
3     1       calcium -0.3188059  0.23470537  0.634911356 -0.56263723
4     1    phosphorus -0.4122683  0.30742742  0.121645691  0.38873649
5     1     magnesium -0.3783931  0.02120981  0.364943145  0.27978194
6     1 total_choline -0.2967836  0.23302443 -0.168578376  0.50088615
7     2        sodium -0.3743060  0.43698495  0.699071399 -0.33839411
8     2     potassium -0.5946492 -0.76583830  0.171918384 -0.08629582
9     2       calcium -0.3197481  0.22247699 -0.570537900 -0.67774814
10    2    phosphorus -0.4135598  0.33696266 -0.163368749  0.34268303
11    2     magnesium -0.3809536  0.02469255 -0.353132495  0.31332542
12    2 total_choline -0.2964644  0.24266497  0.069554628  0.45066111
13    3        sodium -0.3734758  0.43745116 -0.515354388 -0.56504948
14    3     potassium -0.5956553 -0.75695516 -0.126317341 -0.16425013
15    3       calcium -0.3203189  0.26031762  0.757699921 -0.40065689
16    3    phosphorus -0.4133453  0.33027164 -0.001317355  0.37869145
17    3     magnesium -0.3804746 -0.01066089  0.272763199  0.39658028
18    3 total_choline -0.2957891  0.24227484 -0.264467608  0.43878457
19    4        sodium -0.3747221 -0.40198004 -0.663883901  0.44742868
20    4     potassium -0.5959236  0.77379732 -0.091968332  0.13398583
21    4       calcium -0.3194475 -0.31164526  0.626952504  0.53624394
22    4    phosphorus -0.4151579 -0.31389840  0.051217461 -0.39654064
23    4     magnesium -0.3790202 -0.05593685  0.352562234 -0.27346692
24    4 total_choline -0.2939377 -0.20215301 -0.175519778 -0.51212223
          PC5           PC6
1  -0.2549434  0.0007057213
2   0.1302541 -0.0400470694
3   0.3055529  0.1738980509
4   0.2094129 -0.7250901541
5  -0.7746752  0.2116263455
6   0.4250879  0.6305662220
7   0.2554347 -0.0219287203
8  -0.1485148 -0.0287135678
9  -0.1695941  0.1861369245
10 -0.3393716 -0.6753742596
11  0.7890815  0.0934857007
12 -0.3822454  0.7065251169
13 -0.2816917  0.0702214301
14  0.1590974 -0.0629882437
15  0.2790460  0.1308767864
16  0.1166517 -0.7503664182
17 -0.7342400  0.2904157712
18  0.5145436  0.5714731988
19  0.2387897  0.0063752998
20 -0.1387126 -0.0214763889
21 -0.2926091  0.1859662559
22 -0.2757503 -0.7022861383
23  0.8005894  0.1148350182
24 -0.3480469  0.6771423561

Then we can visualize the four perturbed loadings using a bar chart, and notice that they are almost identical for each variable, indicating that our principal components are very stable to these data perturbations.

major_minerals_perturbed_data_loadings_df |>
  ggplot() +
  geom_col(aes(x = nutrient, y = -PC1, group = iter), 
           position = "dodge", col = "white")

Stability to cleaning and pre-processing judgment calls

Next, let’s see if it makes a difference if we choose to center the variables. Note that for this particular perturbation, we feel that it makes more sense to keep the variables uncentered so that they (and the principal components) are interpreted relative to 0, rather than relative to the “average” data point (food item). But for curiosity’s sake we will conduct a stability investigation anyway (and while we’re at it we will also compare the log-transformed vs untransformed components too).

perturb_options <- expand_grid(center = c(TRUE, FALSE),
                               log = c(TRUE, FALSE)) %>%
  mutate(center_log_option = as.character(1:n()))


food_fndds_major_minerals <- food_fndds %>%
  select(description, sodium, potassium, calcium, 
         phosphorus, magnesium, total_choline)
# create a data frame consisting of all of the loadings for each nutrient 
# group for each judgment call perturbation
major_minerals_perturbed_jc_loadings_df <- map_df(1:nrow(perturb_options), function(i) {
  # pre-process dataset using the current judgment call i
  data_perturb <- preProcessFoodData(food_fndds_major_minerals,
                                     .log_transform = perturb_options$log[i],
                                     .center = perturb_options$center[i],
                                     .scale = TRUE)
  pca_perturb <- data_perturb |>
    select(-description) |>
    as.matrix() |>
    svd()
  
  # extract the loadings (matrix V)
  loadings_perturb <- pca_perturb$v
  rownames(loadings_perturb) <- colnames(data_perturb)[-1]
  colnames(loadings_perturb) <- paste0("PC", 1:ncol(loadings_perturb))
  
  # put results in data frame
  loadings_perturb <- loadings_perturb |>
    as.data.frame() |>
    rownames_to_column("nutrient")
  
  return(loadings_perturb)
}, .id = "center_log_option")

major_minerals_perturbed_jc_loadings_df
   center_log_option      nutrient        PC1         PC2         PC3
1                  1        sodium  0.3454451 -0.57033898 -0.54321217
2                  1     potassium  0.4278605  0.10999523  0.47206943
3                  1       calcium  0.3231232  0.66938294 -0.60341806
4                  1    phosphorus  0.4826523 -0.06048367 -0.01870576
5                  1     magnesium  0.4347524  0.26889086  0.32073040
6                  1 total_choline  0.4138014 -0.37226390  0.12140733
7                  2        sodium  0.2784403  0.62651484 -0.20812114
8                  2     potassium  0.4319729 -0.29131868  0.41548151
9                  2       calcium  0.3226231 -0.15953032 -0.78631288
10                 2    phosphorus  0.5521171  0.05222188 -0.10560708
11                 2     magnesium  0.4625976 -0.44918129  0.14665625
12                 2 total_choline  0.3360877  0.54099141  0.36484498
13                 3        sodium -0.3751096 -0.59983681 -0.43870055
14                 3     potassium -0.5932373  0.19964914  0.49606470
15                 3       calcium -0.3210134  0.57290814 -0.69790115
16                 3    phosphorus -0.4143167 -0.11566399 -0.06291459
17                 3     magnesium -0.3794694  0.26374037  0.25207738
18                 3 total_choline -0.2977539 -0.43494268  0.08303235
19                 4        sodium -0.4243779 -0.63277519 -0.22628567
20                 4     potassium -0.4842255  0.27874194  0.42899241
21                 4       calcium -0.2901856  0.23764046 -0.78542584
22                 4    phosphorus -0.4858476  0.06462321 -0.12090874
23                 4     magnesium -0.3767640  0.50805133  0.15763364
24                 4 total_choline -0.3510304 -0.45070300  0.32924043
           PC4         PC5         PC6
1   0.48515344 -0.09278932 -0.12774191
2   0.23832909 -0.71953911  0.08631132
3  -0.18684276 -0.17873614 -0.12863792
4  -0.16620780  0.30650266  0.80091800
5   0.37799533  0.58950185 -0.38134911
6  -0.70880863 -0.01583331 -0.41567759
7   0.68624617  0.04425682 -0.11723186
8   0.27380723 -0.68953728  0.07399960
9  -0.25003345 -0.33360930 -0.27993674
10 -0.19140097  0.26550220  0.75773410
11  0.15250242  0.58016000 -0.45043197
12 -0.57592339 -0.06486507 -0.35407096
13  0.52150932  0.12009181 -0.14365385
14  0.34102952 -0.49520198  0.02456326
15 -0.15283675 -0.19583459 -0.14124583
16 -0.30806520  0.21512458  0.81842686
17 -0.03240032  0.77371663 -0.35101825
18 -0.70172208 -0.23892626 -0.40715292
19  0.57922288 -0.15101672  0.09994266
20  0.30035241  0.63512049 -0.10101747
21 -0.21539018  0.34997349  0.27120325
22 -0.28134521 -0.30448948 -0.75715788
23  0.10645143 -0.59399215  0.45926655
24 -0.66136928  0.07611802  0.34934499

Next, we can look at some bar charts of the loadings for each variables

major_minerals_perturbed_jc_loadings_df |>
  left_join(perturb_options, by = "center_log_option") |>
  ggplot() +
  geom_col(aes(x = nutrient, y = PC1, fill = center), 
           position = "dodge", col = "white") +
  facet_wrap(~paste("log: ", log), ncol = 1)

Notice that when we do not center the coefficients, the PC1 loadings are negative, but when we do, they are positive. Note, however that PC1 and its negation -PC1 are equivalent variables, so to make this comparison fair, let’s look instead at the absolute value of the PC1 loadings. The loadings now look much more similar, but definitely less stable than for the data perturbations. Note that since the log-transformed loadings may not be comparable to the untransformed loadings, we are visualizing them in separate facet panels.

major_minerals_perturbed_jc_loadings_df |>
  left_join(perturb_options, by = "center_log_option") |>
  ggplot() +
  geom_col(aes(x = nutrient, y = abs(PC1), fill = center), 
           position = "dodge", col = "white") +
  facet_wrap(~paste("log: ", log), ncol = 1)

4 PCA applied simultaneously to all nutrient groups

Let’s now apply this same analysis to all of the other nutrient groups. In order to minimize repetitious code, the code below makes heavy use of functions and “nested” tibbles/data frames that will help us to automate applying the same analysis simultaneously to each of the different nutrient groups. The code in Section 4 is quite advanced, so if you prefer, you can just take the code above (for the major minerals nutrient group), and modify it so that it is just applied to each individual nutrient group separately.

First, we wrote a function that will produce a list of dataset for each nutrient group. This function is helpful because it will be simple to apply this same split to each of the different datasets for predictability and stability analysis later on.

# A function for creating the separate nutrient group datasets
createNutrientGroups <- function(.food_data) {
  
  food_fats <- .food_data |> 
    select(contains(c("_acid", "fat", "cholesterol"))) 
  
  food_vitamins <- .food_data |>
    select(beta_carotene, alpha_carotene, lutein_zeaxanthin, phylloquinone, vitamin_c, riboflavin, thiamine, folate, niacin, vitamin_b6, alpha_tocopherol, retinol, vitamin_b12, lycopene, cryptoxanthin)
  
  food_major_minerals <- .food_data |>
    select(sodium, potassium, calcium, phosphorus, magnesium, total_choline)
  
  food_trace_minerals <- .food_data |>
    select(iron, zinc, selenium, copper)
  
  food_carbs <- .food_data |>
    select(total_dietary_fiber, carbohydrates)
  
  # return a tibble with list columns: each entry corresponds to one of the 
  # nutrient datasets
  return(tibble(nutrient_group = c("fats", "vitamins", "major_minerals", 
                                   "trace_minerals", "carbs"),
                data = list(food_fats, food_vitamins, food_major_minerals, 
                            food_trace_minerals, food_carbs),
                description = list(.food_data$description,
                                   .food_data$description,
                                   .food_data$description,
                                   .food_data$description,
                                   .food_data$description)))
}

Then we can create a tibble, whose rows correspond to each nutrient-group dataset.

# put all of the food sub-datasets into a list column for efficient computing
food_grouped_fndds_log_scaled <- createNutrientGroups(food_fndds_log_scaled)
food_grouped_fndds_log_scaled
# A tibble: 5 × 3
  nutrient_group data                  description  
  <chr>          <list>                <list>       
1 fats           <tibble [8,690 × 24]> <chr [8,690]>
2 vitamins       <tibble [8,690 × 15]> <chr [8,690]>
3 major_minerals <tibble [8,690 × 6]>  <chr [8,690]>
4 trace_minerals <tibble [8,690 × 4]>  <chr [8,690]>
5 carbs          <tibble [8,690 × 2]>  <chr [8,690]>

To access one of the datasets, say the trace_minerals dataset, we can extract that entry of the tibble:

food_grouped_fndds_log_scaled |>
  filter(nutrient_group == "trace_minerals") |>
  # pluck the first entry from the data column
  pluck("data", 1)
# A tibble: 8,690 × 4
     iron  zinc selenium copper
    <dbl> <dbl>    <dbl>  <dbl>
 1 0.0536 0.318    0.856 0.343 
 2 0.0536 0.711    1.17  0.0941
 3 0.0536 0.638    1.29  0.167 
 4 0.0885 0.653    0.913 0.0673
 5 0.0536 0.638    1.29  0.167 
 6 0.0536 0.711    1.21  0.0673
 7 0.0712 0.682    0.941 0.0740
 8 0.0359 0.795    1.04  0.0405
 9 0.0536 0.711    1.21  0.0673
10 0.0359 0.795    1.04  0.0405
# ℹ 8,680 more rows

Notice that we use pluck() instead of select(), because select() returns a data frame/tibble consisting of the named column(s), whereas pluck() returns the contents of the named columns (i.e., it plucks the contents from the tibble format).

Then we can use a mutate() function to apply PCA via svd() to each of these datasets simultaneously. We will wrap this in a function so that we can re-use it for our downstream stability analysis.

Note that you could do this with a for loop or a map function, but the objects you create gets messy quickly with these approaches unless you are very careful.

Note that when you want to use mutate() to apply a function to a list column (such as data), you need to use a map() function.

food_grouped_fndds_log_scaled <- food_grouped_fndds_log_scaled |>
  # rowwise lets us apply a function to each entry in a row separately
  rowwise() |>
  # apply PCA to each dataset and save it in a list-column called "pca"
  mutate(pca = list(svd(data)))

This is what our object looks like now (it has a new pca column)

food_grouped_fndds_log_scaled
# A tibble: 5 × 4
# Rowwise: 
  nutrient_group data                  description   pca             
  <chr>          <list>                <list>        <list>          
1 fats           <tibble [8,690 × 24]> <chr [8,690]> <named list [3]>
2 vitamins       <tibble [8,690 × 15]> <chr [8,690]> <named list [3]>
3 major_minerals <tibble [8,690 × 6]>  <chr [8,690]> <named list [3]>
4 trace_minerals <tibble [8,690 × 4]>  <chr [8,690]> <named list [3]>
5 carbs          <tibble [8,690 × 2]>  <chr [8,690]> <named list [3]>

Next, we can extract the loadings from each PCA object, and save them in their own list column.

food_grouped_fndds_log_scaled <- food_grouped_fndds_log_scaled |>
  # Note that rowwise should still be in effect.
  # apply PCA to each dataset and save it in a list-column called "pca"
  mutate(loadings = list(pca$v))

food_grouped_fndds_log_scaled |>
  filter(nutrient_group == "carbs") |>
  pluck("loadings", 1)
           [,1]       [,2]
[1,] -0.5308444  0.8474693
[2,] -0.8474693 -0.5308444

We can also add a column consisting of the PCA transformed dataset for each nutrient group. This one looks a little more complicated, because we are using a map2() function and the function we are applying itself is a little more complicated.

food_grouped_fndds_log_scaled <- food_grouped_fndds_log_scaled |>
  mutate(pc_x = list(set_names(as_tibble(as.matrix(data) %*% loadings),
                               paste0("PC", 1:ncol(loadings)))))

Our object now looks like

food_grouped_fndds_log_scaled
# A tibble: 5 × 6
# Rowwise: 
  nutrient_group data                 description pca          loadings pc_x    
  <chr>          <list>               <list>      <list>       <list>   <list>  
1 fats           <tibble>             <chr>       <named list> <dbl[…]> <tibble>
2 vitamins       <tibble>             <chr>       <named list> <dbl[…]> <tibble>
3 major_minerals <tibble [8,690 × 6]> <chr>       <named list> <dbl[…]> <tibble>
4 trace_minerals <tibble [8,690 × 4]> <chr>       <named list> <dbl[…]> <tibble>
5 carbs          <tibble [8,690 × 2]> <chr>       <named list> <dbl[…]> <tibble>

And we can access the PC-transformed dataset for the trace minerals nutrient group using the following code

food_grouped_fndds_log_scaled |>
  filter(nutrient_group == "trace_minerals") |>
  pluck("pc_x", 1) 
# A tibble: 8,690 × 4
      PC1   PC2    PC3     PC4
    <dbl> <dbl>  <dbl>   <dbl>
 1 -0.802 0.202 -0.511  0.0951
 2 -1.11  0.572 -0.552 -0.199 
 3 -1.16  0.582 -0.638 -0.0832
 4 -0.935 0.442 -0.392 -0.226 
 5 -1.16  0.582 -0.638 -0.0832
 6 -1.12  0.616 -0.560 -0.192 
 7 -0.958 0.453 -0.421 -0.247 
 8 -1.04  0.537 -0.481 -0.331 
 9 -1.12  0.616 -0.560 -0.192 
10 -1.04  0.537 -0.481 -0.331 
# ℹ 8,680 more rows

4.1 Compute the proportion of variance explained

Let’s start using this object to do interesting things. Let’s compute the vector of proportion of variance explained for each PC for each nutrient group as follows:

computePropExplained <- function(.nutrient_group) {
  pca_nutrient_group <- food_grouped_fndds_log_scaled |>
    filter(nutrient_group == .nutrient_group) |>
    pluck("pca", 1)
  prop_var <- pca_nutrient_group$d^2 / sum(pca_nutrient_group$d^2)
  return(prop_var)
}

4.2 Major minerals PCA

Let’s look at the major minerals sub-dataset (recall this is based on the log-transformed, scaled, uncentered dataset). First let’s look at the proportion of variability explained by each of the PCs.

prop_var_major_minerals <- computePropExplained("major_minerals")
prop_var_major_minerals
[1] 0.972004576 0.008478704 0.007926444 0.006051239 0.003742752 0.001796285

Since there are columns in the major minerals dataset, there are 6 principal components.

In the table below, we see that the first PC explains 97% of the variability in the major minerals data.

tibble(PC = 1:length(prop_var_major_minerals),
       `Variability explained` = round(prop_var_major_minerals, 2),
       `Cumulative variability explained` = round(cumsum(prop_var_major_minerals), 2))
# A tibble: 6 × 3
     PC `Variability explained` `Cumulative variability explained`
  <int>                   <dbl>                              <dbl>
1     1                    0.97                               0.97
2     2                    0.01                               0.98
3     3                    0.01                               0.99
4     4                    0.01                               0.99
5     5                    0                                  1   
6     6                    0                                  1   

The nutrient variable loadings are shown in the table below in decreasing order of loading magnitude (in absolute value).

pc_loadings <- food_grouped_fndds_log_scaled |>
  filter(nutrient_group == "major_minerals") |>
  pluck("loadings", 1)
major_mineral_nutrient_name <- food_grouped_fndds_log_scaled |>
  filter(nutrient_group == "major_minerals") |>
  pluck("data", 1) |>
  colnames()

tibble(nutrient = major_mineral_nutrient_name,
       PC1_loading = pc_loadings[, 1]) |>
  arrange(desc(abs(PC1_loading)))
# A tibble: 6 × 2
  nutrient      PC1_loading
  <chr>               <dbl>
1 potassium          -0.593
2 phosphorus         -0.414
3 magnesium          -0.379
4 sodium             -0.375
5 calcium            -0.321
6 total_choline      -0.298

Potassium is the nutrient with the largest weight on the first PC, but all variables have a reasonable loading/weight associated with them, indicating that the first major mineral PC contains a reasonable amount of information from all of the major mineral nutrients.

Note that all of the weights are negative, which is a little bit counter-intuitive. A perfectly equivalent PC variable is the inverse version, where all of the values are negated. This is what we will use in our summaries.

Since the first PC explains so much of the variability in the data, we won’t use PC2, PC3 in our summary, but these later variables can still be very helpful for understanding the distribution of the foods in PC space.

The figure below shows that the major minerals PC1 (and PC2) does a good job of separating the subset of foods that we see in a meaningful way. For instance, several seafood and fish food items seem to have very large PC1 (y-axis) measurements (indicating that they contain a lot of major minerals), and meat items such as beef and pork as well as dairy items such as milk and cheese also have fairly large values (but not quite as high as the fish items).

plotPc1Pc2 <- function(.food_grouped, 
                       .nutrient_group, 
                       .sample_n = NULL,
                       .description_contains = NULL) {
  
  # extract the food names
  description <- .food_grouped |>
    filter(nutrient_group == .nutrient_group) |>
    pluck("description", 1) 
  
  # extract the relevant dataset from the list column
  x_df <- .food_grouped |>
    filter(nutrient_group == .nutrient_group) |>
    pluck("pc_x", 1) |>
    mutate(description = description)
  
  # filter to the food items whose descriptions contains whatever character 
  # string is provided in "description_contains"
  if (!is.null(.description_contains)) {
    x_filtered <- x_df |>
      filter(str_detect(tolower(description), tolower(.description_contains)))
  } else {
    x_filtered <- x_df
  }
  
  # create the plot
  x_df |>
    ggplot() +
    geom_point(aes(x = PC2, y = -PC1), alpha = 0.3, color = "grey50") +
    geom_text(aes(x = PC2, y = -PC1, label = str_trunc(description, 15)), 
              check_overlap = TRUE, hjust = 0, data = x_filtered)  +
    ggtitle(.nutrient_group)
}
plotPc1Pc2(food_grouped_fndds_log_scaled,
           "major_minerals")

Figure 2: A scatterplot of PC1 vs PC2 for the major mineral nutrients.

4.3 Vitamins PCA

prop_var_vitamins <- computePropExplained("vitamins")
prop_var_vitamins
 [1] 0.645684768 0.116733858 0.038612719 0.032097791 0.029051530 0.025352093
 [7] 0.024434816 0.019865259 0.015640180 0.014241356 0.010848861 0.008282093
[13] 0.007410131 0.006329313 0.005415230

Since there are columns in the vitamins dataset, there are principal components.

In the table below (the variability explained by the first 5 PCs), we see that the first PC explains 65% of the variability in the vitamins data. The first two variables explain 76%.

tibble(PC = 1:length(prop_var_vitamins),
       `Variability explained` = round(prop_var_vitamins, 2),
       `Cumulative variability explained` = round(cumsum(prop_var_vitamins), 2)) |>
  head(5)
# A tibble: 5 × 3
     PC `Variability explained` `Cumulative variability explained`
  <int>                   <dbl>                              <dbl>
1     1                    0.65                               0.65
2     2                    0.12                               0.76
3     3                    0.04                               0.8 
4     4                    0.03                               0.83
5     5                    0.03                               0.86

The nutrient variable loadings are shown in the table below in decreasing order of loading magnitude (in absolute value).

pc_loadings <- food_grouped_fndds_log_scaled |>
  filter(nutrient_group == "vitamins") |>
  pluck("loadings", 1)
vitamins_nutrient_name <- food_grouped_fndds_log_scaled |>
  filter(nutrient_group == "vitamins") |>
  pluck("data", 1) |>
  colnames()

tibble(nutrient = vitamins_nutrient_name,
       PC1_loading = pc_loadings[, 1]) |>
  arrange(desc(abs(PC1_loading)))
# A tibble: 15 × 2
   nutrient          PC1_loading
   <chr>                   <dbl>
 1 folate                -0.530 
 2 niacin                -0.308 
 3 phylloquinone         -0.300 
 4 lutein_zeaxanthin     -0.285 
 5 beta_carotene         -0.267 
 6 alpha_tocopherol      -0.265 
 7 retinol               -0.244 
 8 vitamin_c             -0.230 
 9 riboflavin            -0.227 
10 vitamin_b6            -0.211 
11 thiamine              -0.203 
12 vitamin_b12           -0.151 
13 alpha_carotene        -0.128 
14 cryptoxanthin         -0.125 
15 lycopene              -0.0922

Folic acid is the nutrient with the largest weight on the first PC. Note that all of the weights are negative again, so we will negate the values of this PC in our summaries and downstream analyses.

The figure below shows that the vitamin PC1 (and PC2) also does a good job of separating the subset of foods that we see in a meaningful way. The foods with the highest values for vitamins seem to mostly be cereals, followed by liver meats.

plotPc1Pc2(food_grouped_fndds_log_scaled, "vitamins")

Figure 3: A scatterplot of PC1 vs PC2 for the vitamins nutrients.

4.4 Fats PCA

prop_var_fats <- computePropExplained("fats")
prop_var_fats
 [1] 6.269467e-01 1.167937e-01 9.562445e-02 4.025333e-02 2.545646e-02
 [6] 2.005710e-02 1.357243e-02 1.174056e-02 9.219004e-03 8.539094e-03
[11] 7.194402e-03 6.432447e-03 4.362469e-03 4.139721e-03 2.967874e-03
[16] 2.071057e-03 1.284908e-03 1.106013e-03 9.037673e-04 7.759569e-04
[21] 3.103251e-04 1.778018e-04 5.124429e-05 1.917549e-05

Since there are columns in the fats dataset, there are principal components. In the table below

In the table below (the variability explained by the first 5 PCs), we see that the first PC explains 63% of the variability in the fats data. The first two variables explain 74%.

tibble(PC = 1:length(prop_var_fats),
       `Variability explained` = round(prop_var_fats, 2),
       `Cumulative variability explained` = round(cumsum(prop_var_fats), 2)) |>
  head(5)
# A tibble: 5 × 3
     PC `Variability explained` `Cumulative variability explained`
  <int>                   <dbl>                              <dbl>
1     1                    0.63                               0.63
2     2                    0.12                               0.74
3     3                    0.1                                0.84
4     4                    0.04                               0.88
5     5                    0.03                               0.91

The nutrient variable loadings are shown in the table below in decreasing order of loading magnitude (in absolute value).

pc_loadings <- food_grouped_fndds_log_scaled |>
  filter(nutrient_group == "fats") |>
  pluck("loadings", 1)
fats_nutrient_name <- food_grouped_fndds_log_scaled |>
  filter(nutrient_group == "fats") |>
  pluck("data", 1) |>
  colnames()

tibble(nutrient = fats_nutrient_name,
       PC1_loading = pc_loadings[, 1]) |>
  arrange(desc(abs(PC1_loading)))
# A tibble: 24 × 2
   nutrient             PC1_loading
   <chr>                      <dbl>
 1 fat                       -0.386
 2 saturated_fat             -0.318
 3 monounsaturated_fat       -0.317
 4 oleic_acid                -0.312
 5 palmitic_acid             -0.306
 6 polyunsaturated_fat       -0.279
 7 linoleic_acid             -0.262
 8 stearic_acid              -0.256
 9 cholesterol               -0.240
10 alpha_linolenic_acid      -0.178
# ℹ 14 more rows

“Fat” is the nutrient with the largest weight on the first PC, followed by saturated and then monounsaturated fat. Note that all of the weights are negative again, so we will negate the values of this PC in our summaries and downstream analyses.

The figure below shows that the fat PC1 (and PC2) also does a good job of separating the subset of foods that we see in a meaningful way. The foods with the highest values for fats seem to mostly be ghee, butter, etc (which is unsurprising).

plotPc1Pc2(food_grouped_fndds_log_scaled, "fats")

Figure 4: A scatterplot of PC1 vs PC2 for the fats nutrients.

PC2 seems to be picking up on fish-specific fats (likely “healthy fatty acids”). We may want to consider using this variable in our summaries, but for now we will keep it simple and stick with PC1 only (one avenue of analysis might be to identify which fat nutrients are being captured by PC2 and creating an entirely new nutrient group from them).

Notice that many milk products seem to have very low fat content, which is quite surprising. Closer inspection, however, shows that this is mostly fat free milk or “NS [not sure] as to fat content” (which were likely fat free or low fat).

set.seed(4662)
food_grouped_fndds_log_scaled |> 
  filter(nutrient_group == "fats") |>
  # extract PC transformed data for fats nutrient group
  pluck("pc_x", 1) |> 
  # add a description column
  mutate(description = food_fndds$description) |>
  # filter to milk products with fat PC variable < 3
  filter(abs(PC1) < 3, 
         str_detect(description, "Milk")) |>
  select(description) |>
  sample_n(10)
# A tibble: 10 × 1
   description                                       
   <chr>                                             
 1 Milk, calcium fortified, fat free (skim)          
 2 Milk, dry, not reconstituted, fat free (skim)     
 3 Milk, dry, not reconstituted, NS as to fat content
 4 Milk, dry, not reconstituted, low fat (1%)        
 5 Milk, dry, reconstituted, NS as to fat content    
 6 Milk, lactose free, low fat (1%)                  
 7 Milk, calcium fortified, low fat (1%)             
 8 Milk dessert bar, frozen, made from lowfat milk   
 9 Milk, lactose free, reduced fat (2%)              
10 Milk, acidophilus, reduced fat (2%)               

4.5 Trace minerals PCA

prop_var_trace_minerals <- computePropExplained("trace_minerals")
prop_var_trace_minerals
[1] 0.84276327 0.08076958 0.04614135 0.03032580

Since there are columns in the trace minerals dataset, there are principal components.

In the table below (the variability explained by the first 5 PCs), we see that the first PC explains 84% of the variability in the trace minerals data. The first two variables explain 92%.

tibble(PC = 1:length(prop_var_trace_minerals),
       `Variability explained` = round(prop_var_trace_minerals, 2),
       `Cumulative variability explained` = round(cumsum(prop_var_trace_minerals), 2)) |>
  head(5)
# A tibble: 4 × 3
     PC `Variability explained` `Cumulative variability explained`
  <int>                   <dbl>                              <dbl>
1     1                    0.84                               0.84
2     2                    0.08                               0.92
3     3                    0.05                               0.97
4     4                    0.03                               1   

The nutrient variable loadings are shown in the table below in decreasing order of loading magnitude (in absolute value).

pc_loadings <- food_grouped_fndds_log_scaled |>
  filter(nutrient_group == "trace_minerals") |>
  pluck("loadings", 1)
trace_minerals_nutrient_name <- food_grouped_fndds_log_scaled |>
  filter(nutrient_group == "trace_minerals") |>
  pluck("data", 1) |>
  colnames()

tibble(nutrient = trace_minerals_nutrient_name,
       PC1_loading = pc_loadings[, 1]) |>
  arrange(desc(abs(PC1_loading)))
# A tibble: 4 × 2
  nutrient PC1_loading
  <chr>          <dbl>
1 selenium      -0.583
2 iron          -0.543
3 zinc          -0.509
4 copper        -0.327

“Selenium” is the nutrient with the largest weight on the first PC, followed by iron and zinc, with copper having lower weight. Note that all of the weights are negative again, so we will negate the values of this PC in our summaries and downstream analyses.

The figure below shows that the fat PC1 (and PC2) also does a good job of separating the subset of foods that we see in a meaningful way. The foods with the highest values for trace minerals seem to mostly be oysters and liver meats, followed by some cereals.

plotPc1Pc2(food_grouped_fndds_log_scaled, "trace_minerals")

Figure 5: A scatterplot of PC1 vs PC2 for the trace minerals nutrients.

4.6 Carbohydrates PCA

prop_var_carbs <- computePropExplained("carbs")
prop_var_carbs
[1] 0.94359913 0.05640087

Since there are columns in the carbohydrates dataset, there are principal components. In the table below

In the table below (the variability explained by the first 5 PCs), we see that the first PC explains 94% of the variability in the carbohydrates data. The first two variables explain 100%.

tibble(PC = 1:length(prop_var_carbs),
       `Variability explained` = round(prop_var_carbs, 2),
       `Cumulative variability explained` = round(cumsum(prop_var_carbs), 2)) |>
  head(5)
# A tibble: 2 × 3
     PC `Variability explained` `Cumulative variability explained`
  <int>                   <dbl>                              <dbl>
1     1                    0.94                               0.94
2     2                    0.06                               1   

The nutrient variable loadings are shown in the table below in decreasing order of loading magnitude (in absolute value).

pc_loadings <- food_grouped_fndds_log_scaled |>
  filter(nutrient_group == "carbs") |>
  pluck("loadings", 1)
carbs_nutrient_name <- food_grouped_fndds_log_scaled |>
  filter(nutrient_group == "carbs") |>
  pluck("data", 1) |>
  colnames()

tibble(nutrient = carbs_nutrient_name,
       PC1_loading = pc_loadings[, 1]) |>
  arrange(desc(abs(PC1_loading)))
# A tibble: 2 × 2
  nutrient            PC1_loading
  <chr>                     <dbl>
1 carbohydrates            -0.847
2 total_dietary_fiber      -0.531

“carbohydrates” is the nutrient with the largest weight on the first PC, with total dietary fiber having a lesser weight. Note that all of the weights are negative again, so we will negate the values of this PC in our summaries and downstream analyses.

The figure below shows that the fat PC1 (and PC2) also does a good job of separating the subset of foods that we see in a meaningful way. The foods with the highest values for carbohydrates seem to mostly be oysters and liver meats, followed by some cereals.

plotPc1Pc2(food_grouped_fndds_log_scaled, "carbs")

Figure 6: A scatterplot of PC1 vs PC2 for the carbs nutrients.

5 PCS analysis

Next, we want to evaluate our results using the principles of PCS. Specifically, we want to show that they are predictable (that they re-emerge in external data, namely the “legacy” dataset), and that they are stable to data perturbations and the judgment calls that we made throughout the DSLC so far.

5.1 Predictability

To show that our results are predictable, we will reproduce our analysis on the alternative legacy dataset which is acting as an external validation set.

The fact that we wrote re-usable functions above makes it relatively straightforward to apply the same analysis to a new dataset.

food_grouped_legacy_log_scaled <- createNutrientGroups(food_legacy_log_scaled)

# create the list columns
food_grouped_legacy_log_scaled <- food_grouped_legacy_log_scaled |>
  rowwise() |>
  # apply PCA to each dataset and save it in a list-column called "pca"
  mutate(pca = list(svd(data))) |>
  # extract loadings from pca column
  mutate(loadings = list(pca$v)) |>
  # compute PC transformed data
  mutate(pc_x = list(set_names(as_tibble(as.matrix(data) %*% loadings),
                               paste0("PC", 1:ncol(loadings)))))

Our legacy food object now looks like the original fndds one, with each of the same list columns (except computed from the legacy data).

food_grouped_legacy_log_scaled
# A tibble: 5 × 6
# Rowwise: 
  nutrient_group data                 description pca          loadings pc_x    
  <chr>          <list>               <list>      <list>       <list>   <list>  
1 fats           <tibble>             <chr>       <named list> <dbl[…]> <tibble>
2 vitamins       <tibble>             <chr>       <named list> <dbl[…]> <tibble>
3 major_minerals <tibble [7,793 × 6]> <chr>       <named list> <dbl[…]> <tibble>
4 trace_minerals <tibble [7,793 × 4]> <chr>       <named list> <dbl[…]> <tibble>
5 carbs          <tibble [7,793 × 2]> <chr>       <named list> <dbl[…]> <tibble>

Figure 7 shows the -PC1 vs PC2 plot for the Major minerals data (the equivalent of Figure 2). We specifically filtered to just show the “pork” food items to make comparisons easier (feel free to play with this and look at other food items).

To our eyes, especially for the fats and trace minerals nutrient groups, it seems like the plot is showing much the same trends for each dataset, which provides us with some fairly solid evidence that the patterns picked up by the PCA algorithm are predictable. Less so for the major minerals dataset though.

(plotPc1Pc2(food_grouped_fndds_log_scaled, "major_minerals", .description_contains = "pork,") + 
   plotPc1Pc2(food_grouped_legacy_log_scaled, "major_minerals", .description_contains = "pork,")) / 
  (plotPc1Pc2(food_grouped_fndds_log_scaled, "vitamins", .description_contains = "pork,") + 
     plotPc1Pc2(food_grouped_legacy_log_scaled, "vitamins", .description_contains = "pork,")) / 
  (plotPc1Pc2(food_grouped_fndds_log_scaled, "fats", .description_contains = "pork,") + 
     plotPc1Pc2(food_grouped_legacy_log_scaled, "fats", .description_contains = "pork,")) / 
  (plotPc1Pc2(food_grouped_fndds_log_scaled, "trace_minerals", .description_contains = "pork,") + 
     plotPc1Pc2(food_grouped_legacy_log_scaled, "trace_minerals", .description_contains = "pork,"))

Figure 7: A scatterplot of PC1 vs PC2 for the major mineral nutrients based on the legacy validation data.

Figure 8 shows histograms of the PC1 variable for each of the two major minerals datasets (FNDDS and legacy). The overall PC1 shape is similar for each nutrient group, except for the major minerals, where the legacy PC1 values are generally higher and less variable (the peak is skinnier) than the FNDDS PC1 values.

plotPc1Histograms <- function(.nutrient_group) {
  fndds_pc1 <- food_grouped_fndds_log_scaled |> 
    filter(nutrient_group == .nutrient_group) |>
    pluck("pc_x", 1) |>
    select(PC1) |>
    mutate(dataset = "fndds")
  legacy_pc1 <- food_grouped_legacy_log_scaled |> 
    filter(nutrient_group == .nutrient_group) |>
    pluck("pc_x", 1) |>
    select(PC1) |>
    mutate(dataset = "legacy")
  
  rbind(fndds_pc1, legacy_pc1) |> ggplot() +
    geom_histogram(aes(x = PC1, y = ..density.., fill = dataset), 
                   position = "identity", col = "white", 
                   alpha = 0.5, bins = 30) +
    ggtitle(.nutrient_group)
}

# piece the plots together using the patchwork library
(plotPc1Histograms("fats") + plotPc1Histograms("vitamins")) / 
  (plotPc1Histograms("major_minerals") + plotPc1Histograms("trace_minerals"))
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Figure 8: Histograms of the major minerals PC1 distribution for the FNDDS and legacy datasets.

Figure 9 shows the same information but using a QQ-plot to compare the FNDDS and legacy PC1 distributions for the major minerals dataset. Overall, the distributions seem fairly similar (they approximately follow a straight diagonal line), except that they don’t seem to have the same magnitude (the points lie above the identity diagonal line).

plotPc1QQ <- function(.nutrient_group) {
  fndds_pc1 <- food_grouped_fndds_log_scaled |> 
    filter(nutrient_group == .nutrient_group) |>
    pluck("pc_x", 1) |>
    pull(PC1)
  
  legacy_pc1 <- food_grouped_legacy_log_scaled |> 
    filter(nutrient_group == .nutrient_group) |>
    pluck("pc_x", 1) |>
    pull(PC1)
  
  
  
  
  data.frame(fndds_pc1_q = quantile(fndds_pc1, seq(0, 1, 0.01)),
             legacy_pc1_q = quantile(legacy_pc1, seq(0, 1, 0.01))) |>
    ggplot() +
    geom_point(aes(x = fndds_pc1_q, y = legacy_pc1_q), alpha = 0.7) +
    geom_abline(intercept = 0, slope = 1)  
}

# piece the plots together using the patchwork library
(plotPc1QQ("fats") + plotPc1QQ("vitamins")) / 
  (plotPc1QQ("major_minerals") + plotPc1QQ("trace_minerals"))

Figure 9: A QQ plot comparing the major minerals PC1 distribution for the FNDDS and legacy datasets. The identity diagonal line is shown as a solid line.

5.2 Stability to data perturbations

We will investigate two types of data perturbations below:

  1. Bootstrap: to represent the fact that different food items may have been included in the observed data

  2. Adding random noise: to represent inaccuracies in the nutrient measurements.

We will re-compute the principal components using 4 different versions of the training data, each of which involves a bootstrap sample of the original food items, and each of which has been modified by adding a small amount of “noise” where we add random numbers whose magnitude is up to around 20% of the observed measurement. Since our data has been scaled to have standard deviation 1, we roughly approximate such noise using random numbers drawn from a Gaussian distribution that has mean 0 and standard deviation of 0.2 multiplied by the mean of each column (to represent an error of 20%). For any resulting perturbed values that ended up being negative, we rounded them up to 0.

To investigate the plausibility of this perturbation, Figure 10 below shows some histograms that compares the distribution of the original measurements and the perturbed measurements for three nutrient variables for an example perturbed dataset. Overall the distributions are similar, which indicates that this perturbation is reasonably plausible.

set.seed(28464)

# create a perturbed version of the food dataset
perturb_ex <- food_fndds_log_scaled |>
  # add random noise to each observation (noise is equal to 0 on average)
  mutate(across(where(is.numeric), 
                function(x) x + rnorm(n(), 0, 0.2 * mean(x))))  |>
  # if any observations are negative, round them up to 0
  mutate(across(where(is.numeric), 
                function(x) if_else(x < 0, true = 0, false = x)))

# compare the distribution of the original and the perturbed nutrient values
examinePerturbed <- function(nutrient_col) {
  rbind(data.frame(source = "original",
                   value = food_fndds_log_scaled[[nutrient_col]]),
        data.frame(source = "perturbed",
                   value = perturb_ex[[nutrient_col]])) |>
    ggplot() + 
    geom_histogram(aes(x = value, y = ..density.., fill = source),
                   position = "identity", alpha = 0.6, bins = 30) +
    ggtitle(nutrient_col)
}

# arrange the plots in rows using patchwork
examinePerturbed("fat") / examinePerturbed("beta_carotene") / examinePerturbed("zinc")

Figure 10: Histograms comparing the original and perturbed distributions for three nutrient variables

Let’s create four different versions of the perturbed data and visually compare the PCA results. To compare the results, let’s just look at whether the loadings change for each dataset. The following code relies heavily on map() functions, but you could certainty manually create four perturbed versions of the data and run the analysis on each of them.

# map_df(1:4, ....) means map through the following function 4 times and save 
# the results in a dataframe
perturb_data_loadings_df <- map_df(1:4, function(i) {
  # create a perturbed version of each nutrient group dataset
  food_grouped_perturbed <- food_grouped_fndds_log_scaled |>
    select(nutrient_group, data) |>
    # undo rowwise
    ungroup() |>
    # add a data perturbed list column, consisting of a perturbed version of 
    # the original data for each nutrient group
    mutate(data_perturbed = map(data, function(.df)  {
      .df |>
        # add random noise to each observation (noise is equal to 0 on average)
        mutate(across(where(is.numeric), 
                      function(x) x + rnorm(n(), 0, 0.2 * mean(x))))  |>
        # if any observations are negative, round them up to 0
        mutate(across(where(is.numeric), 
                      function(x) if_else(x < 0, true = 0, false = x))) |>
        sample_frac(1, replace = TRUE) 
    }))
  
  # apply PCA to each perturbed dataset
  food_grouped_perturbed <- food_grouped_perturbed |>
    rowwise() |>
    mutate(pca_perturbed = list(svd(data_perturbed))) |>
    transmute(loadings_perturbed = list(tibble(nutrient_group = nutrient_group, 
                                               nutrient_variable = colnames(data),
                                               pc1_loading = pca_perturbed$v[, 1]))) |>
    unnest(loadings_perturbed)
}, .id = "iter")

The following plot shows a bar representing each of the 4 perturbed loading values for the major minerals nutrient variables. They are virtually (but not exactly) identical.

perturb_data_loadings_df |>
  filter(nutrient_group == "fats") |>
  ggplot() + 
  geom_col(aes(x = nutrient_variable, y = pc1_loading, group = iter), 
           position = "dodge", col = "white") 

perturb_data_loadings_df |>
  filter(nutrient_group == "vitamins") |>
  ggplot() + 
  geom_col(aes(x = nutrient_variable, y = pc1_loading, group = iter), 
           position = "dodge", col = "white") 

perturb_data_loadings_df |>
  filter(nutrient_group == "major_minerals") |>
  ggplot() + 
  geom_col(aes(x = nutrient_variable, y = pc1_loading, group = iter), 
           position = "dodge", col = "white") 

perturb_data_loadings_df |>
  filter(nutrient_group == "trace_minerals") |>
  ggplot() + 
  geom_col(aes(x = nutrient_variable, y = pc1_loading, group = iter), 
           position = "dodge", col = "white") 

Since the loadings are very stable, this means that all of the downstream results (such as the plots, etc) will also be very stable to the data perturbations.

5.3 Stability to judgment call perturbations

Let’s repeat our analyses using alternative data cleaning judgment calls. We will focus here on the pre-processing judgment call options to log-transform and center the data.

perturb_options <- expand_grid(center = c(TRUE, FALSE),
                               log = c(TRUE, FALSE)) |>
  mutate(perturbation_id = as.character(1:n())) |>
  mutate(perturbation = case_when(center & log ~ "log, center",
                                  center & !log ~ "center",
                                  !center & log ~ "log",
                                  !center & !log ~ "none"))


# create a data frame consisting of all of the loadings for each nutrient 
# group for each judgment call perturbation
perturbed_jc_loadings_df <- map_df(1:nrow(perturb_options), function(i) {
  # pre-process dataset using the current judgment call i
  food_preprocessed <- preProcessFoodData(food_fndds,
                                          .log_transform = perturb_options$log[i],
                                          .center = perturb_options$center[i],
                                          .scale = TRUE)
  
  # create the tibble with the nutrient group data list columns
  food_grouped_jc <- createNutrientGroups(food_preprocessed)
  
  # create the list columns
  food_grouped_jc <- food_grouped_jc |>
    rowwise() |>
    # apply PCA to each dataset and save it in a list-column called "pca"
    mutate(pca = list(svd(data))) |>
    # compute the loadings for each nutrient group
    transmute(loadings = list(data.frame(nutrient_group = nutrient_group, 
                                         nutrient_variable = colnames(data),
                                         pc1_loading = pca$v[, 1]))) |> 
    unnest(loadings)
  return(food_grouped_jc)
}, .id = "perturbation_id")

# look at the tibble we created
perturbed_jc_loadings_df
# A tibble: 204 × 4
   perturbation_id nutrient_group nutrient_variable pc1_loading
   <chr>           <chr>          <chr>                   <dbl>
 1 1               fats           butyric_acid            0.166
 2 1               fats           caproic_acid            0.172
 3 1               fats           caprylic_acid           0.131
 4 1               fats           capric_acid             0.178
 5 1               fats           lauric_acid             0.129
 6 1               fats           myristic_acid           0.230
 7 1               fats           palmitic_acid           0.314
 8 1               fats           stearic_acid            0.290
 9 1               fats           oleic_acid              0.291
10 1               fats           linoleic_acid           0.222
# ℹ 194 more rows

Let’s look at how the loadings change across the four different pre-processing perturbations.

perturbed_jc_loadings_df |>
  left_join(select(perturb_options, perturbation_id, perturbation), 
            by = "perturbation_id") |>
  filter(nutrient_group == "fats") |>
  # force the nutrients to be in decreasing order of loading
  group_by(nutrient_variable) |>
  mutate(avg_pc1_loading = mean(abs(pc1_loading))) |>
  ungroup() |>
  arrange(abs(avg_pc1_loading)) |>
  mutate(nutrient_variable = fct_inorder(nutrient_variable)) |>
  # create plot
  ggplot() +
  geom_point(aes(x = pc1_loading, y = nutrient_variable, color = perturbation)) 

Interestingly, it looks as though the loadings from the center and log,center perturbations are the approximate negatives of perturbations log and none. Let’s try this again, but with the absolute value of the loadings:

perturbed_jc_loadings_df |>
  left_join(select(perturb_options, perturbation_id, perturbation), 
            by = "perturbation_id") |>
  filter(nutrient_group == "fats") |>
  # force the nutrients to be in decreasing order of loading
  group_by(nutrient_variable) |>
  mutate(avg_pc1_loading = mean(abs(pc1_loading))) |>
  ungroup() |>
  arrange(abs(avg_pc1_loading)) |>
  mutate(nutrient_variable = fct_inorder(nutrient_variable)) |>
  # create plot
  ggplot() +
  geom_point(aes(x = abs(pc1_loading), y = nutrient_variable, color = perturbation)) 

It looks as though the overall loadings trends are fairly stable across the different judgment call perturbations, which is great. Let’s look at this for the other nutrient groups

plotPerturbedLoadings <- function(.nutrient_group) {
  perturbed_jc_loadings_df |>
    left_join(select(perturb_options, perturbation_id, perturbation), 
              by = "perturbation_id") |>
    filter(nutrient_group == .nutrient_group) |>
    # force the nutrients to be in decreasing order of loading
    group_by(nutrient_variable) |>
    mutate(avg_pc1_loading = mean(abs(pc1_loading))) |>
    ungroup() |>
    arrange(abs(avg_pc1_loading)) |>
    mutate(nutrient_variable = fct_inorder(nutrient_variable)) |>
    # create plot
    ggplot() +
    geom_point(aes(x = abs(pc1_loading), y = nutrient_variable, color = perturbation))  +
    theme(panel.grid.major.y = element_line(color = "grey80"))  +
    ggtitle(.nutrient_group)
}


(plotPerturbedLoadings("fats") + plotPerturbedLoadings("major_minerals")) / 
  (plotPerturbedLoadings("trace_minerals") + plotPerturbedLoadings("vitamins"))

Overall, the loadings trends seem fairly stable, at least in terms of rankings (i.e. which nutrients are considered most important for computing the principal component). But there are clearly some differences between these techniques.

If we have to choose just one, which one should we choose? Let’s compare the distributions of the resulting principal components.

perturbed_pc1 <- map_df(1:nrow(perturb_options), function(i) {
  # pre-process dataset
  food_preprocessed <- preProcessFoodData(food_fndds,
                                          .log_transform = perturb_options$log[i],
                                          .center = perturb_options$center[i],
                                          .scale = TRUE)
  
  # create the tibble with the nutrient group data list columns
  food_grouped_perturbed <- createNutrientGroups(food_preprocessed)
  
  # create the list columns
  food_grouped_perturbed <- food_grouped_perturbed |>
    rowwise() |>
    # apply PCA to each dataset and save it in a list-column called "pca"
    mutate(pca = list(svd(data))) |>
    mutate(loadings = list(pca$v)) |> 
    mutate(pc_x = list(set_names(as_tibble(as.matrix(data) %*% loadings),
                                 paste0("PC", 1:ncol(loadings))))) |>
    transmute(pc1 = list(tibble(nutrient_group = nutrient_group,
                                description = description,
                                PC1 = pc_x$PC1))) |>
    unnest(pc1)
  return(food_grouped_perturbed)
}, .id = "perturbation_id")

Based on the figure below, which shows histograms of the perturbed PC1 distribution for each of the four nutrient groups we have been exploring for each of the four pre-processing judgment call combinations, it appears that the log-transformation judgment call is the one that has the most impact on the results, especially for the major minerals and vitamins nutrient groups. Specifically when we log-transform the data, the principal components seem to a less skewed distribution (the blank space in several of the plots implies that there are some extreme values present). But whether or not this is preferable is unclear.

plotPerturbedPc1 <- function(.nutrient_group) {
  perturbed_pc1 |> 
    left_join(select(perturb_options, perturbation_id, perturbation), 
              by = "perturbation_id") |>
    filter(nutrient_group == .nutrient_group) |>
    ggplot() +
    geom_histogram(aes(x = PC1, y = ..density..), col = "white") +
    facet_wrap(~perturbation, scale = "free") +
    ggtitle(.nutrient_group)
}


(plotPerturbedPc1("fats") + plotPerturbedPc1("major_minerals")) / 
  (plotPerturbedPc1("trace_minerals") + plotPerturbedPc1("vitamins"))

6 Creating the “final” PCA-based summary dataset

In a situation such as this, where there are differences in the results, but it is not clear which set of judgment calls is the “best”, we recommend asking a domain professional which result (in this case, which PC-based summary variables) seems most sensible to them.

However, in the absence of a domain professional, we will choose the version that makes most sense for our downstream application. Since we want a measurement of 0 to actually correspond to an absence of that particular type of nutrient, it probably makes sense to use one of the non-centered dataset which yields summary variables with negative and positive values. Since the log-transformed PC variables seem to have values that are more spread out, we will go with the log-transformed dataset but we will need to negate each of the PC-derived variables so that the resulting values are positive (i.e., by multiplying each PC variable by -1).

food_grouped_final <- createNutrientGroups(food_fndds_log_scaled)

# create the list columns, and conduct the PCA analysis to create the PC 
# transformed columns
food_grouped_final <- food_grouped_final |>
  rowwise() |>
  # apply PCA to each dataset and save it in a list-column called "pca"
  mutate(pca = list(svd(data))) |>
  # extract loadings from pca column
  mutate(loadings = list(pca$v)) |>
  # compute PC transformed data
  mutate(pc_x = list(set_names(as_tibble(as.matrix(data) %*% loadings),
                               paste0("PC", 1:ncol(loadings)))))
 

# create a simple data frame consisting of the PC1 transformed variable for each nutrient group
food_pc_final <- food_grouped_final |>
  rowwise() |>
  transmute(final_data = list(tibble(nutrient_group = nutrient_group,
                                     description = description,
                                     PC1 = -pc_x$PC1))) |>
  unnest(final_data) |>
  pivot_wider(names_from = "nutrient_group", values_from = "PC1") |>
  # add the protein and calories variables 
  # (scaled, but not log-transformed or centered)
  mutate(protein = food_fndds$protein / sd(food_fndds$protein),
         calories = food_fndds$calories / sd(food_fndds$protein)) |>
  # make it explicit which variables are based on PCs
  rename(fats_pc1 = fats,
         vitamins_pc1 = vitamins,
         major_minerals_pc1 = major_minerals,
         trace_minerals_pc1 = trace_minerals,
         carbs_pc1 = carbs)
set.seed(287643)
food_pc_final |> 
  mutate(description = str_trunc(description, 25)) |>
  mutate(across(where(is.numeric), ~round(., 2))) |>
  sample_n(10)
# A tibble: 10 × 8
   description       fats_pc1 vitamins_pc1 major_minerals_pc1 trace_minerals_pc1
   <chr>                <dbl>        <dbl>              <dbl>              <dbl>
 1 Beef, shortribs,…     6.6          3.82              10.3                4.84
 2 Beans, string, c…     1.91         5.18               8.82               0.99
 3 Butterscotch har…     3.23         0.77               3.19               0.33
 4 Grapefruit juice…     0.11         2.17               6.78               0.32
 5 Pasta, whole gra…     5.16         5.35               9.45               3.46
 6 Yogurt, whole mi…     3.09         2                  8.82               1.08
 7 Pork, tenderloin…     7.27         5.73              11.2                4.23
 8 Peas, cowpeas, f…     0.67         4.35              10.4                3.11
 9 Topping, dietetic     0            0                  1.32               0.2 
10 Rice, brown, wit…     2.15         5.19               8.82               1.96
# ℹ 3 more variables: carbs_pc1 <dbl>, protein <dbl>, calories <dbl>

If desired to simplify interpretation, we can convert each of these summary nutrient variables into a categorical “very low”, “low”, “high”, “very high”, etc by specifying cutoffs for each category.

quantile_cutoffs <- food_pc_final |>
  pivot_longer(2:ncol(food_pc_final), 
               names_to = "nutrient_group", values_to = "amount") |>
  group_by(nutrient_group) |>
  summarise(q25 = quantile(amount, 0.25),
            q50 = quantile(amount, 0.5),
            q75 = quantile(amount, 0.75)) 

The following code defines a crude function for summarizing food items in terms of these summary nutrients using the 0.25, 0.5, and 0.75 quantiles as cutoffs.

nutritionSummary <- function(.food_item) {
  food_pc_final |>
    mutate_if(is.numeric, function(.var) {
      case_when(.var > quantile(.var, 0.75) ~ "very high",
                .var > quantile(.var, 0.5) & .var <= quantile(.var, 0.75) ~ "high",
                .var > quantile(.var, 0.25) & .var <= quantile(.var, 0.5) ~ "low",
                .var <= quantile(.var, 0.25) ~ "very low")
    }) |>
    filter(description == .food_item)
  
}

Let’s look at the nutrition summary for Nachos with cheese and sour cream

nutritionSummary("Nachos with cheese and sour cream")
# A tibble: 1 × 8
  description        fats_pc1 vitamins_pc1 major_minerals_pc1 trace_minerals_pc1
  <chr>              <chr>    <chr>        <chr>              <chr>             
1 Nachos with chees… very hi… high         very high          high              
# ℹ 3 more variables: carbs_pc1 <chr>, protein <chr>, calories <chr>

Boiled potatoes:

nutritionSummary("Potato, boiled, from fresh, peel not eaten, made with margarine")
# A tibble: 1 × 8
  description        fats_pc1 vitamins_pc1 major_minerals_pc1 trace_minerals_pc1
  <chr>              <chr>    <chr>        <chr>              <chr>             
1 Potato, boiled, f… low      low          low                very low          
# ℹ 3 more variables: carbs_pc1 <chr>, protein <chr>, calories <chr>

A pasta dish:

nutritionSummary("Pasta, whole grain, with tomato-based sauce and added vegetables, ready-to-heat")
# A tibble: 1 × 8
  description        fats_pc1 vitamins_pc1 major_minerals_pc1 trace_minerals_pc1
  <chr>              <chr>    <chr>        <chr>              <chr>             
1 Pasta, whole grai… low      very high    low                high              
# ℹ 3 more variables: carbs_pc1 <chr>, protein <chr>, calories <chr>

As an exercise, create a visualization that would give you a sense of how an individual food item of your choice fares in each nutrient category, i.e., how “nutritionally balanced” your selected food item is.